AITopics | test case

Collaborating Authors

test case

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

c1f0b856a35986348ab3414177266f75-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 22:29:19 GMT

Large language models are now tuned to align with the goals of their creators, namely to be "helpful and harmless." These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study adversarial alignment, and ask to what extent these models remain aligned when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Europe > Switzerland > Zürich > Zürich (0.04)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)
(2 more...)

Genre: Research Report (0.46)

Industry:

Information Technology > Security & Privacy (0.93)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

abe1eb21ceb046209c96a0f5e7544ccc-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 11:44:36 GMT

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Barbara County > Santa Barbara (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

819cebb05f993840e8a52d7564c5c282-Paper-Conference.pdf

Neural Information Processing SystemsFeb-15-2026, 13:35:58 GMT

large language model, machine learning, programming language, (21 more...)

Neural Information Processing Systems

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
Europe > Netherlands > South Holland > Delft (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Software Engineering (0.95)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
(2 more...)

Add feedback

7a92bcdede88c7afd108072faf5485c8-Paper-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsFeb-15-2026, 07:09:53 GMT

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > Arizona (0.05)
Asia > Middle East > Jordan (0.04)
North America > United States > Colorado (0.04)
(2 more...)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.94)

Add feedback

Code Repair with LLMs gives an Exploration-Exploitation Tradeoff

Neural Information Processing SystemsDec-27-2025, 09:15:48 GMT

Iteratively improving and repairing source code with large language models (LLMs), known as refinement, has emerged as a popular way of generating programs that would be too complex to construct in one shot. Given a bank of test cases, together with a candidate program, an LLM can improve that program by being prompted with failed test cases. But it remains an open question how to best iteratively refine code, with prior work employing simple greedy or breadth-first strategies. We show here that refinement exposes an explore-exploit tradeoff: exploit by refining the program that passes the most test cases, or explore by refining a lesser considered program. We frame this as an arm-acquiring bandit problem, which we solve with Thompson Sampling. The resulting LLM-based program synthesis algorithm is broadly applicable: Across loop invariant synthesis, visual reasoning puzzles, and competition programming problems, we find that our new method can solve more problems using fewer language model calls.

artificial intelligence, large language model, natural language, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents

Neural Information Processing SystemsDec-26-2025, 15:40:27 GMT

Rigorous software testing is crucial for developing and maintaining high-quality code, making automated test generation a promising avenue for both improving software quality and boosting the effectiveness of code generation methods. However, while code generation with Large Language Models (LLMs) is an extraordinarily active research area, test generation remains relatively unexplored. We address this gap and investigate the capability of LLM-based Code Agents to formalize user issues into test cases. To this end, we propose a novel benchmark based on popular GitHub repositories, containing real-world issues, ground-truth bug-fixes, and golden tests. We find that LLMs generally perform surprisingly well at generating relevant test cases, with Code Agents designed for code repair exceeding the performance of systems designed specifically for test generation. Further, as test generation is a similar but more structured task than code generation, it allows for a more fine-grained analysis using issue reproduction rate and coverage changes, providing a dual metric for analyzing systems designed for code repair. Finally, we find that generated tests are an effective filter for proposed code fixes, doubling the precision of SWE-Agent.

artificial intelligence, large language model, natural language, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

SPoC: Search-based Pseudocode to Code

Neural Information Processing SystemsDec-25-2025, 13:48:03 GMT

We consider the task of mapping pseudocode to executable code, assuming a one-to-one correspondence between lines of pseudocode and lines of code. Given test cases as a mechanism to validate programs, we search over the space of possible translations of the pseudocode to find a program that compiles and passes the test cases. While performing a best-first search, compilation errors constitute 88.7% of program failures. To better guide this search, we learn to predict the line of the program responsible for the failure and focus search over alternative translations of the pseudocode for that line. For evaluation, we collected the SPoC dataset (Search-based Pseudocode to Code) containing 18,356 C++ programs with human-authored pseudocode and test cases. Under a budget of 100 program compilations, performing search improves the synthesis success rate over using the top-one translation of the pseudocode from 25.6% to 44.7%.

artificial intelligence, name change, proceedings, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.64)

Add feedback

IRM--when it works and when it doesn't: A test case of natural language inference

Neural Information Processing SystemsDec-24-2025, 13:07:29 GMT

Invariant Risk Minimization (IRM) is a recently proposed framework for out-of-distribution (o.o.d) generalization. Most of the studies on IRM so far have focused on theoretical results, toy problems, and simple models. In this work, we investigate the applicability of IRM to bias mitigation-a special case of o.o.d generalization-in increasingly naturalistic settings and deep models. Using natural language inference (NLI) as a test case, we start with a setting where both the dataset and the bias are synthetic, continue with a natural dataset and synthetic bias, and end with a fully realistic setting with natural datasets and bias. Our results show that in naturalistic settings, learning complex features in place of the bias proves to be difficult, leading to a rather small improvement over empirical risk minimization. Moreover, we find that in addition to being sensitive to random seeds, the performance of IRM also depends on several critical factors, notably dataset size, bias prevalence, and bias strength, thus limiting IRM's advantage in practical scenarios. Our results highlight key challenges in applying IRM to real-world scenarios, calling for a more naturalistic characterization of the problem setup for o.o.d generalization.

artificial intelligence, name change, proceedings, (5 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.83)

Technology: Information Technology > Artificial Intelligence (0.43)

Add feedback

Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards

Groeneveld, Jan Niklas, Qin, Xi, Schaefer, Alexander, Oren, Yaad

arXiv.org Artificial IntelligenceDec-11-2025

Generating high-quality code remains a challenge for Large Language Models (LLMs). For the evolution of reasoning models on this task, reward models are a necessary intermediate step. These models judge outcomes or intermediate steps. Decoder-only transformer models can be turned into reward models by introducing a regression layer and supervised fine-tuning. While it is known that reflection capabilities generally increase with the size of a model, we want to investigate whether state-of-the-art small language models like the Phi-4 family can be turned into usable reward models blending the consideration of process rewards and outcome rewards. Targeting this goal, we construct a dataset of code samples with correctness labels derived from the APPS coding challenge benchmark. We then train a value-head model to estimate the success probability of intermediate outputs. Our evaluation shows that small LLMs are capable of serving as effective reward models or code evaluation critics, successfully identifying correct solutions among multiple candidates. Using this critic, we achieve over a 20% improvement in the search capability of the most accurate code out of multiple generations.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.23083

Country:

North America > United States > California > Orange County > Irvine (0.14)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > China > Liaoning Province > Shenyang (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Reasoning Under Pressure: How do Training Incentives Influence Chain-of-Thought Monitorability?

MacDermott, Matt, Wei, Qiyao, Djoneva, Rada, Ward, Francis Rhys

arXiv.org Artificial IntelligenceDec-10-2025

AI systems that output their reasoning in natural language offer an opportunity for safety -- we can \emph{monitor} their chain of thought (CoT) for undesirable reasoning, such as the pursuit of harmful objectives. However, the extent to which CoT faithfully reflects the underlying reasoning process, and hence the extent to which it can be usefully monitored, may be influenced by certain aspects of training. We investigate how different \emph{training incentives}, applied to a reasoning model, affect its monitorability. We introduce a novel methodology for measuring monitorability according to whether a monitor can predict a key latent variable using the model's reasoning. When controlling for accuracy, we do not find evidence for consistent effects from commonly used incentives (length penalties and KL regularisation), but we find that adversarial optimisation (penalising monitor accuracy) degrades monitor performance, while direct optimisation for monitorability does not reliably lead to improvements. Our code is available at https://github.com/QiyaoWei/reasoning-under-pressure.

large language model, machine learning, monitorability, (21 more...)

arXiv.org Artificial Intelligence

2512.00218

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Monaco (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.65)

Add feedback